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Abstract — Supporting the quality of service of unlicensed 
users in cognitive radio networks is very challenging, mainly 
due to dynamic resource availability because of the licensed 
users' activities. In this paper, we study the optimal admission 
control and channel allocation decisions in cognitive overlay 
networks in order to support delay sensitive communications of 
unlicensed users. We formulate it as a Markov decision process 
problem, and solve it by transforming the original formulation 
into a stochastic shortest path problem. We then propose a 
simple heuristic control policy, which includes a threshold-based 
admission control scheme and and a largest-delay-first channel 
allocation scheme, and prove the optimality of the largest- 
delay-flrst channel allocation scheme. We further propose an 
improved policy using the rollout algorithm. By comparing the 
performance of both proposed policies with the upper-bound of 
the maximum revenue, we show that our policies achieve close- 
to-optimal performance with low complexities. 

Index Terms — Admission control, Markov decision process, 
Bellman's equation, rollout algorithm 

I. Introduction 

Cognitive radio technology has the potential to significantly 
improve specnum utilization and accommodate many more 
devices in the limited spectrum. Supporting Quality of Service 
(QoS), however, is challenging in cognitive radio networks due 
to the dynamically changing network resources. In this paper, 
we will design an admission control and channel allocation 
mechanism to support delay-sensitive real-time secondary un- 
licensed communications. Compared with the resource allo- 
cation in conventional communication networks, the unique 
challenge here is to incorporate the impact of primary licensed 
users on the availability of the communication resources. 

Optimal channel selection of a single secondary unlicensed 
user has been well studied in the literature (e.g., J2], (3]). 
Zhao et al. J2] considered the total expected reward maxi- 
mization problem when the secondary user can only sense 
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one channel at a time. Liu et al. [0 further considered the 
case where the secondary user can sense multiple channels si- 
multaneously. The resource allocation problem becomes more 
complicated when there are multiple secondary users (e.g., H, 
121). Zhou et al. H jointly considered channel allocation with 
power control. Urgaonkar and Neely J5) developed opportunis- 
tic scheduling policies to provide performance guarantees. 

Admission control is critical for supporting QoS when 
there are too many users that want to access the network 
simultaneously. In traditional cellular networks, many results 
have shown that the optimal admission control policy has a 
threshold structure (e.g., lU-H)). in cognitive radio networks, 
researchers have studied admission control for both underlay 
networks (e.g., ll9l- lfTT1 ) and overlay networks (e.g., Ifl2l . 
0~3)). In cognitive overlay networks, admission control is 
often jointly pursued with channel allocation, as the secondary 
users can only access idle channels not occupied by primary 
users. Admission control also can be jointly considered with 
other mechanisms, e.g., Kim and Shin lfl2l considered joint 
optimal admission and eviction control using semi-Markov 
decision process and linear programming. Mutlu et al. lfl"3ll 
investigated the problem of optimal spot pricing of spectrum 
for maximizing the profit from the admission of secondary 
users. 

In this paper, we consider the joint admission control and 
channel allocation problem for cognitive overlay networks. 
Our problem is very different from the throughput max- 
imization for elastic data traffic studied in most previous 
literature Q], 0. We want to support the secondary users' 
real-time applications (e.g., VoIP and video streaming) with 
stringent delay constraints. 

The rest of the paper is organized as follows. We describe 
the system model in Section [II] and formulate the admission 
control and channel allocation problem as a Markov Decision 
Process (MDP) in Section [Til] In Section [IV] we transform 
the problem into a stochastic shortest path problem and 
prove the convergence of the Bellman's equation. Section [V] 
proposes a heuristic control policy and an improved rollout 
policy, together with the corresponding theoretical analysis and 
simulation results. We finally conclude in Section [VT] 

II. System Model 

This paper studies a cognitive radio network as shown in 
Fig. Q] We consider an infrasnucture-based secondary unli- 
censed network, where a secondary network operator senses 
the channel availabilities (i.e., primary licensed users' activi- 
ties) and decides the admission control and channel allocation 
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Fig. 1. A cognitive radio network scenario. In the secondary network, the 
dotted arrows denote the channels between the secondary base station and the 
secondary users. 
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Fig. 2. The components of a time slot. 

for the secondary users. A similar network architecture has 
been considered in several recent literature (e.g., Ifl4ll - |[l7ll ). 
Comparing with the distributed network architecture where 
end users need to perform spectrum sensing individually, the 
network architecture considered in this paper has the advantage 
of reducing the complexity of the secondary user devices 
and providing better QoS support. Such infrastructure-based 
network without user sensing requirement is also consistent 
with the recent ruling of FCC (Federal Communications 
Commission) on the TV white space sharing lfl8l . 

One way to realize network-based spectrum sensing is to 
construct a sensor network that is dedicated to sensing the 
radio environment in space and time |fl9l . The secondary 
base station will collect the sensing information from the 
sensor network and provide it to the unlicensed users, which 
is called "sensing as service". There has been significant 
current research efforts along this direction in the context of an 
European project SENDORA l20l . which aims at developing 
techniques based on sensor networks for supporting coexis- 
tence of licensed and unlicensed wireless users in a same area. 

In our model, the time is divided into equal length slots. 
Primary users' activities remain roughly unchanged within a 
single time slot. This means that it is enough for the operator 
to sense once at the beginning of each time slot (see Fig. [2]). 
For readers who are interested in the optimization of the time 
slot length to balance sensing and data transmission, see Ell . 

The network has a set J = {1,...,J} of orthogonal 
primary licensed channels. The state of each channel follows 
a Markovian ON/OFF process as in Fig. [3] If a channel is 
"ON", then it means that the primary user is not active on the 
channel and the channel condition is good enough to support 
the transmission rate requirement of a secondary user. Here we 
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Fig. 3. Markovian ON/OFF model of channel activities. 



assume that all secondary users want to achieve the same target 
transmission rate (e.g., that of a same type of video streaming 
application). If a channel is "OFF", then either a primary user 
is active on this channel, or the channel condition is not good 
enough to achieve the secondary user's target rate. In the time 
slotted system, the channel state changes from "ON" to "OFF" 
("OFF" to "ON", respectively) between adjacent time slots 
with a probability p (q, respectively). When a channel is "ON", 
it can be used by a secondary unlicensed user. 

We consider an infinitely backlog case, where there are 
many secondary users who want to access the idle channels. 
Each idle channel can be used by at most one secondary user 
at any given time. A secondary user represents an unlicensed 
user communicating with the secondary base station as shown 
in Fig. Q] The secondary users are interested in real-time 
applications such as video streaming and VoIP, which require 
steady data rates with stringent delay constraints. The key QoS 
parameter is the accumulative delay, which is the total delay 
that a secondary user experiences after it is admitted into the 
system. Once a secondary user is admitted into the network, 
it may finish the session normally with a certain probability. 
However, if the user experiences an accumulative delay larger 
than a threshold, then its QoS significantly drops (e.g., freezing 
happens for video streaming) and the user will be forced to 
terminate. 

To make the analysis tractable, we make several assump- 
tions. First, we assume that the availabilities of all channels 
follow the same Markovian model. This is reasonable if 
the traffic types of different primary users are similar (e.g., 
all primary users are voice users). Second, we assume that 
all secondary users experience the same channel availability 
independent of their locations. This is reasonable when the 
secondary users are close-by. Third, we assume the spec- 
trum sensing is error-free. This can be well approximated 
by having enough sensors performing collaborating sensing. 
Furthermore, we assume that all channels are homogeneous 
and can provide the same data rate to any single secondary 
user using any channel. Finally, we assume that all secondary 
users are homogeneous (i.e., interested in the same application 
such as video streaming). Each secondary user only requires 
one available channel to satisfy its rate requirement. Several 
of the above assumptions can be relaxed by increasing the 
state space of the MDP formulation. As we will see shortly, 
the admission control and channel allocation issue in this 
homogeneous case is already quite complicated and admits no 
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closed-form solutions. The analysis and insights of this paper 
will enable us to further consider heterogeneous channels and 
secondary users in the future. 

III. Problem Formulation 

We formulate the admission control and channel allocation 
problem as an MDP ll22l . In an infinite-horizon MDP with a 
set of finite states S, the state evolves through time according 
to a transition probability matrix {P Xk x k+1 }, which depends 
on both the current state and the control decision from a set 
U. More specifically, if the network is in state xu in time slot 
k and selects a decision u(xk) G U(xk), then the network 
obtains a revenue g(xk,u(xk)) in time slot k and moves to 
state Xk+i in time slot k + 1 with probability P Xk x k+1 {u{xk)). 
We want to maximize the long-term time average revenue, i.e., 



, ^^9{xk,u(x k )) I . 
I fc=0 J 



(1) 



A. The State Space 

The system state describes system information after the 
network performs spectrum sensing at the beginning of the 
time slot (see Fig. |2j. It consists of two components: 

• A channel state component, m = a T ■ a, describes the 
number of available channels. Here a = (cij,Vj G J) is 
the channel availability vector, where aj = 1 (or 0) when 
channel j is available (or not). 

• A user state component, uj e = {uj e _i,\/i G T>), describes 
the numbers of secondary users with different accumu- 
lative delays. Here V = {0,1,..., D max } is the set of 
possible delays, and ui Ey i denotes the number of secondary 
users whose accumulative delay is i. 

We let M. denote the feasible set of the channel 
state component, and ft denote the feasible set of the 
user state component. The state space is given by S = 
{(m, uj e )\m E M,u) e £ tt} . 

State 9 is said to be accessible from state r\ if and only if it is 
possible to reach state from r\, i.e., P {reach 8\start in 77} > 
l23l . Two states that are accessible to each other are said to 
be able to communicate with each other. In our formulation, 
all the states in space S are accessible from state 0, which 
is defined as a state where there is no available channel and 
no single admitted secondary user in the system. Since it is 
possible to have m = in several consecutive time slots (when 
primary traffic is heavy and occupies all channels), thus state 
is accessible from any state in the state space S. Hence, 
all the states communicate with each other and the Markov 
chain is irreducible. Finally, the state space is finite, so all the 
states are positive recurrent l23l . This property turns out to 
be critical for the analysis in Section IPVl 

B. The Control Space 

For the state x k = {m, u: e } G S in each time slot k, the set 
of available control choices U(xk) depends on the relationship 
between the channel state and the user state. The control vector 
u{xk) = {u a , u e } consists of two parts: scalar u a denotes the 



number of admitted new secondary users, and vector u e = 
{u e ,i,Vz G T>} denotes the numbers of secondary users who 
are allocated channels and have accumulative delays of i £ 
T> at the beginning of the current time slot. Without loss of 
generality, we assume < u a < J, i.e., we will never admit 
more secondary users than the total number of channels. This 
leads to < w e .o < w e .o + u a , < u e ,i < us e ,i for all 
i G [1,-Dmax], and < Ylf=?o* u e,i < m - Since m < J, the 
cardinality of the control space U is J DmBI+2 . 
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C. The State Transition 

Current state x k = {m,u> e } 6 S together with the control 
u(xk) G U{xk) determine the probability of reaching the next 
state x k +i = {m', ui' e }. 

First, the transition of channel state component from m 
to m! depends on the underlying primary traffic. We can 
divide m' available channels into two groups: one group 
contains m! x channels which are available in the (current) 
time slot k, the other group contains m' 2 channels which 
are not available in time slot k. Let us define the set Z = 
{(m[,m2)\m'=m[+m2,0<m[<m,0<m' 2 <J—m}. Then we 
can calculate the probability based on the i.i.d. ON/OFF model 
in Section m 

(2) 

Thus the channel transition function is / s (m) = m! with 
probability P mm > for all m! G M. 

Let us define u> c = {ui Cy i,Vi G T)} as the number of 
secondary users who normally complete their connections (not 
due to delay violation) in time slot k. For example, a user may 
terminate a video streaming session after the movie finishes, 
or terminate a VoIP session when the conversation is over. If 
we assume that all users have the same completion probability 
Pf per slot when they are actively served, then the event of 
having p out of r users completing their connections (denoted 
as fc(r) = p) happens with probability Q-Pf (1 - PfY~ p - 

Finally, define ui q as the number of secondary users who 
are forced to terminate their connections during time slot k. 
The state transition can be written as 

' m' = f s {m), 

Uc,i = /c(Me,i),Vi G V, 



U e ,D„ 



Ue,D„ 



^€,0 = U eS) — ^c,0) 
, U)' ei = U e ,i + {L>e t i-i -U e ,i-l)-CJ Ct i,Vi G [2,D max \. 



(3) 



Let us take a network with J 



10 and D, n 



as a numerical example. In a particular time slot, assume 
that there are m = 7 channels available and a total of 6 
secondary users admitted in the system: 1 user with zero 
accumulative delay, 3 users with 1 time slot of accumulative 
delay, and 2 users with 2 time slots of accumulative delay. 
Then the state vector is {m,u> e } = {7,(1,3,2)}. Assume 
the control decision is to admit 2 new users and to allocate 
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available channels to the users except one of the new users, 
i.e., u = {u a ,u e } = {2,(2,3,2)}. Thus if there is no user 
completing a connection in the current time slot and mf = 4 
available channels in the next time slot, the system state 
becomes {m',u)' e } = {4,(2,4,2)}. 

D. The Objective Function 

Our system optimization objective is to choose the optimal 
control decision for each possible state to maximize the 
expected average revenue per time slot (also called stage), i.e., 



T-l 



max lim E \ ^ V g(x k ,u{x k )) 



(4) 



Here the revenue function is computed at the end of each time 
slot k as follows: 



-Dmax 



C.I 



i=0 



-D max 

(k)+R t J2 u e ,i(k)-C q u q {k), (5) 

i=0 



where R c > is the reward of completing the connection of 
a secondary user normally (without violating the maximum 
delay constraints), R t > is the reward of maintaining the 
connection of a secondary user, and C q > is the penalty of 
forcing to terminate a connection. By choosing different values 
of R c , R t , and C q , a network designer can achieve different 
objective functions. In this paper, we assume that the values 
of R c , R t , and C q are given parameters. 

IV. Analysis of the MDP Problem 

We define a sequence of control actions as a policy, fi = 
{u(xq), u(x±), ■ ■ ■}, where u(x k ) £ U{x k ) for all k. A policy 
is stationary if the choice of decision only depends on the state 
and is independent of the time. Let 



Vy.(6) = lim E 



j; g(xk,u(x k ))\x = 9 \ 
k=a ) 



be the expected revenue in state 9 under policy /a. Our 
objective is to find the best policy /x* to optimize the average 
revenue per stage starting from an initial state 9. 

Section IIII-AI shows that any state can be visited from 
any other state within finite stages under a stationary policyQ 
Moreover, since the revenue g(x k ,u(x k )) < oo for all x k and 
u, we have 



lim 



(6) 



for any finite K. Therefore, we have the following proposition 
in our prior preliminary results JT]. 

Proposition 1: For any stationary policy, the average rev- 
enue per stage is independent of the initial state. 

Next we give the following detailed proof of the proposition. 

1 A policy is stationary if the choice of decision only depends on the state 
and is independent of the time. 




Fig. 4. Transition probability of the shortest path problem. 



Proof: Since the revenue g(x k ,u(x k )) < oo for all x k 
and u, we have 




g(x k ,u(x k )) } = 



(7) 



for any finite value of K. Consider a stationary policy fi whose 
control decision only depends on the state of the system. 
According to the MDP formulation, all the states are positive 
recurrent. So starting in state 9, the process will visit state 
r) infinitely often; and the expected time that the process 
visits state 77 from state 9 is finite ll23l . Thus, any state in 
the state space can be visited from any other state within 
enough stages (finite) under the stationary policy. Therefore, 
we assume, under the policy fi, the state rj £ S is visited 
from the state 9 £ S. Let Kg v (fi) be the number of time slots 
that the system first passes state rj from state 9 under policy 
fi, then the average revenue per stage corresponding to initial 
condition xr> = 9 can be expressed as 



■ Kg v (p,)-l 

V^9) = ^lim -E \ Yl 9(x k ,u(x k )) 

k=0 



T-l 



(8) 



+ T li i 1 L^ jB l ^ g(x k ,u(x k )) 

The first term in ^ is zero according to (O, while the second 
limit is equal to V^rf). So with E{Ko v (fJ,)} < 00, 

V„(0) - V^rj) - V„ (9) 

for any two states 9 and 77. ■ 
As shown in Proposition Q] the average revenue per stage 
under any stationary policy is independent of the initial state, 
and the average revenue maximization problem could be 
transformed into the stochastic shortest path problem. More 
specifically, we pick a state n as the start state of the stochastic 
shortest path problem, and define an artificial termination state 
t from the state n. The transition probability from an arbitrary 
state 9 to the termination state t satisfies Petit- 1 ) = Penil 1 )* 
as show in Fig. |4] 

In the stochastic shortest path problem, we define —g(n, fi) 
as the expected stage cost incurred at state n under policy fi. 
Let A* be the optimal average revenue per stage starting from 
the state n to the terminal state t, and let A* — g(n, /z) be the 
normalized expected stage cost. Then the normalized expected 
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terminal cost from the state xq — n under the policy fi, 
h^in) = liniAT^ E \j2k=a i A * ~ 9(xk,u(x k ))}\, is zero 
when the policy /x is optimal. The cost minimization in the 
stochastic shortest path problem is equivalent to the original 
average revenue per stage maximization problem. Let h*(9) 
denote the optimal cost of the stochastic shortest path starting 
at state 9 € S, then we get the corresponding Bellman's 
equation as follows 1221 : 



(10) 

If /x* is a stationary policy that maximizes the cycle revenue, 
we have the following equations: 

h*(6) = A* - g(6,tx*) + Y^P6v(»*)h*(v), OeS. (11) 

The Bellman's equation is an iterative way to solve MDP 
problems. Next we show that solving the Bellman's equation 
$Y2\ in the stochastic shortest path problem leads to the optimal 
solution. 

Proposition 2: For the stochastic shortest path problem, 
given any initial values of terminal costs ho(6) for all states 
9 G S, the sequence {hi(6),l = 1,2,...} generated by the 
iteration 



h l+1 (0)=mm ^A*-g(0, t i)+Y / Pg v (f J ,)h l (r 1 )j ,9 e S, 

(12) 

converges to the optimal terminal cost h*(0) for each state 8. 

Proof: For an arbitrary state 9 and an admissible policy /x, 
there exists an integer 7 satisfying P{x 1 ^ t\x$ = 6, /x} < 1 
ll24l . Let p = max(0 jM ) P{x 7 ^ t\x$ = 9, /x}, then p < 1 and 

P {x 2l 7^ t\x = 6, fi} = P{x2j 7^ t\x~t ^ t,x = 9, /x} ■ 
P{x 7 7^ t\x.Q = 9,fi} < p 2 . Therefore, we get P{x ( j >1 7^ 

t\x = e,fj,}<p^. 

We break down the cost /i M (xo) into the portions incurred 
over the first A'7 time slots (K is a positive integer) and over 
the remaining time slots, i.e., 

h»(x ) = lim eIY] {A* -g(x k ,u(x k ))}\ 

I k=0 ) 



bounded by p K T, so that 



■Kj-l 



--El Yl {A* -g(x k ,u(x k ))} 



(13) 



fc=0 



JV-1 



lim El V {A* -g(x k ,u(x k ))} 

7v — >oo * — * 

k=Ky 



Define T = 7max(j i(1 ) \A* — g(9, fi)\, which denotes the 
upper bound on the cost of an 7-slot cycle when termination 
does not occur during the cycle. Then, the expected cost during 
the A'-th 7-slot cycle (time slots A7 to (if + 1)7 — 1) is upper 



E 



K7-1 

h»(x )- ]T {A* - g{x k , u{x k ))} 

k=0 
( JV-1 

lim E < Y, {A* - g(x k ,u(x k ))} 

k=Kj 

p k t 



(14) 



h=K 



P 



Let ^0(^0) be a terminal cost function as defined in the 
proposition, and then its expected value under /x after A 7 
time slots is bounded by 



\E{h (XK-y)}h 



J2 p (xKy = 6\x ,ii)ho(6) 



\ees j 



(15) 



m.ax\h (9)\. 



Since the probability that xx-y 7^ t is less than or equal to p K 
for any policy, we have \E{Ho(xkj)}\ < p K maxg 6l s \h (9)\. 
Therefore, we can get 



- p K max \h (9)\ +h fl (x ) - 

ees 1 - p 

( Kf-l 



< E -j h (x Kj ) + Y {A* - g[x k ,u(x k ))} 

p K T 



(16) 



k=0 



< p K max\ho(6)\ + /i M (x ) + 



9es 



1-p 



The expected value in the middle term of the above inequalities 
is the A'7-slot cost of policy /x starting from state xq with 
a terminal cost ho(xKi)- The minimum of this cost over 
all /x is equal to the value /i_r- 7 (xo), which is generated by 
the dynamic programming recursion (fT2l after A'7 iterations. 
Thus, by taking the minimum over /x in ( TToT ), we obtain for 
all xq and A', 



-p K max I h Q (9) I + h*(x ) 
ees 

,K 



p K T 

1-p 



< h Kl {x ) 



<p K max\h (9)\+h*(x ) + 



9es 



p K T 



(17) 



And by taking the limit when A — > 00, the terms involving p K 
will go to zero, and we obtain limx_ i . 00 hK-y{xo) = h*(xo) 
for all xq. In addition, since \hK^+ q (xo) — hK-y(xo)\ < 
p K T, q = 1, 2, • ■ • , 7 - 1, we have lirn^oo h Kl+q (x ) = 
lim/f^oo h Kj (x ) = h*(x ) for all q = 1, • • • , 7 - 1. ■ 
Proposition [2] shows that solving the Bellman's equation 
leads to the optimal average revenue A* and the optimal 
differential cost h*. The Bellman's equation can often be 
solved using value iteration or policy iteration algorithms; 
details can be found in [24] and l25l . Once having A* and 
h*, we can compute the optimal control decision u*(9) that 
minimizes the immediate differential cost of the current stage 
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plus the remaining expected differential cost for state 8, i.e., 
u^9) = argmm\A*-g(8,fi) + J2p8r 1 (»)h*(r ] ) \. (18) 



V. Suboptimal Control and Dynamic Programming 

Solving the Bellman's equation does not lead to a closed- 
form optimal control policy, and the iterative computation is 
time-consuming for our problem with a large state space. To 
resolve this issue, a broad class of suboptimal control methods 
referred as approximate dynamic programming (ADP) have 
been proposed in El . Next we first propose a simple heuristic 
control policy in Section IV-AI Then in Section IV-BI we 
will improve the performance of the heuristic algorithm by 
using the idea of rollout algorithm (which is a class of ADP 
algorithms). It is known that the suboptimal policy based on 
the rollout algorithm is identical to the policy obtained by a 
single policy improvement step of the classical policy iteration 
method l24l. E51. 

A. Heuristic Control Policy 

Several observations can help us with the suboptimal algo- 
rithm design. First, the channel state transitions are determined 
by the underlying primary traffic and are not affected by any 
control policy. Second, all secondary users experience the 
same channel availability independent of their locations, and 
all channels are homogenous and provide the same data rates. 
This means that we are interested in how many users to admit 
rather than who to admit, and we only care how many channels 
are available instead of which are available. This motivates 
us to first consider admission control and channel allocation 
separately. 

For the admission control, we first consider a simple 
threshold-based strategy, where a new user will be admitted if 
and only if the total number of admitted users is smaller than 
the threshold. Given a fixed admission control threshold Th, 
there are many ways of performing the channel allocation. To 
resolve this issue, we propose the largest-delay-first strategy, 
which allocates available channels to admitted users with the 
largest accumulated delay first. 

Proposition 3: The largest-delay-first channel allocation 
policy is optimal under any fixed threshold-based admission 
control policy. 

Proof: Under a threshold-based admission control policy, 
the number of admitted users in the system is constant in any 
time slot. The objective function in is equal to the max- 
imization problem m&xE {g(x, u(x))} due to ergodicity of 
the instant revenue g(xk,u(xk))- Let fi c = E ■f^i^o"* w c,i| 
be the expected number of normally completed users at the 
end of each time slot, Q e = E ^J2f=o"* w e.i| be the expected 
number of users in the network at the end of each time slot, and 
il q = E {uj q } be the expected number of forcefully terminated 
users at the end of each time slot. Then 

maxfi {g(x, u(x))} = max{i? c f2 c + R t Q e — C q fl q }. (19) 



Under the threshold-based admission control policy, Q e = 
IZf^a" 1 w e,i in a ll ti me slot k and equals to the threshold. 

In the largest-delay-first policy, let L c be the expected length 
of a normally completed session, D c the expected delay of a 
normally completed session, and L q the expected length of a 
forcefully terminated session. Now let us consider an arbitrary 
channel allocation policy as the benchmark, and we use the 
superscript (g) to denote all parameters corresponding to this 
particular channel allocation policy, i.e., il{ 9 \ £li 9 \ £l q 9 \ L^ 9 \ 
Di 9 \ and L q 9 \ We will show that the largest-delay-first policy 
is no worse than this benchmark policy, which will prove the 
proposition. 

Because all actively served users have the same completion 
probability Pf independent of the channel allocation decisions, 
we can show that fl c = Vl[ 9 \ O e = Vt[ 9 \ and L c = L {9) . 
Since the largest-delay-first policy always allocates available 
channels to the secondary users with the largest delay, we have 
D C >D {9 \ 

Here comes the critical proof step. We consider S7 e virtual 
channels, one for each user in the network. If the secondary 
user is allocated an available physical channel, then its virtual 
channel is "idle" in that time slot; otherwise its virtual channel 
is "busy" and causes a delay. In the long run (when T — > oc), 
we have the following: 

n e ■ T = n c T(L c + D c ) + n q T(L q + D max ) 

= n c Wr(i c w + £>W) + nWr(i,w + D max ). (20) 

Based on the relationships we just derived in the previous 
paragraph, we have 

n q (L q + D max ) < n<»>(i#) + Dmax ). (2i) 

Since the number of available channels is the same under the 
two channel allocation policies in any time slot , we have 

fl c TL c + n q TL q = Vt { c 9) TL { c 9) + n q g) TL q 9 \ (22) 

Since fl c = fli 9 ^ and L c = Li 9 \ ( 1221 implies that fl q L q = 
QqLq. Together with inequality d2"Tl i. we have Q q < fl q 9 \ 

Because O c = fl^ , fl e = ili 9) , and fl q < fl^ , we 
have R c fl c + R t fl e - C q Vt q > R c Vt[ 9) + R t fl { e 9) - C q n q g) , 
i.e., m&xE {g(xk,u(xk))} >maxE{g^(xk,u(xk))}- This 
shows that our proposed largest-delay-first channel allocation 
policy is no worse than any channel allocation algorithm, and 
thus is optimal with a threshold-based admission control. ■ 

For performance comparison, we further define two bench- 
mark channel allocation strategies. 

• Strategy 1: allocate the available channels to the admitted 
users with the smallest accumulated delays. If there is a 
tie, break it randomly. 

• Strategy 2: allocate the available channels to the admitted 
users randomly. 

In Fig. [5] and Fig. [6] we compare the proposed channel 
allocation policy and the two benchmark policies with different 
total number of channels. All three policies follow the same 
threshold-based admission control policies. From these figures, 
we observe that the proposed largest-delay-first policy is no 
worse than the other two under all choices of parameters. 
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Revenue vs. Threshold (p=0.5.q=0.5,R c =10,R t =1 r C =10,P 1= 0.01) 




Fig. 5. Revenue versus threshold of three different strategies (J 

5, Dmax — 5). 

Revenue comparison (p=0.5, q=0.5, R -10, R=1, C -10, P.=0.01) 



- Proposed Strategy 

- Strategy 1 

- Strategy 2 




5 6 7 

Number of channels (J) 



Fig. 6. Maximum expected revenue comparison with different channels. 



B. Rollout Control Policy 

The heuristic algorithm proposed in Section IV-AI can be 
further improved by the rollout algorithm. The general back- 
ground of the rollout algorithm is in Appendix [A] In this 
subsection, based on the analysis of the heuristic control 
policy, we propose a simplified rollout algorithm {rollout 
control policy) to further improve the performance. 

Consider two different user states and that have 
the same number of secondary users. If it is possible for 
transit from state to state uj^ under a particular channel 
condition without admitting any new user, then obviously the 
total time delay of uj^ 1 summed over all users must be less 
than that of (as each user either has the same delay or 
a larger delay during the transition). We give the following 
definitions: 

Definition 1 (User State Comparison): Consider two dif- 
ferent user states a;*, 1 ) and that have the same number 
of secondary users. If it is possible to transit from state 
to state uA 2 ) under a particular channel condition without 
admitting any new user, then u/ 1 ) is better than u/ 2 ), denoted 

e e 

Definition 2 (Quality of Channel State): Consider a user 
state u>(y> and a channel state m. The channel state m is 
B (Bad) for the user state ui^ 1 if and only if m is less than 
the total number of users in a/ 1 ). Otherwise, the channel state 



m is G (Good) for the user state OJ^. 

Now consider a heuristic control policy with the admission 
control threshold Nth an d largest-delay-first channel allocation 
mechanism. Under this policy, we can divide the infinite- 
horizon process into infinite number of segments separated by 
the time slots in which there is at least one user leaving the 
system (normal completion or forced termination). Then we 
can define a new average revenue g(Nth, 8) and its expectation 
G(N t h,9)) over each segment. Due to the threshold-based 
admission control, we will only admit new users in the first 
slot of a segment. 

Definition 3 (Average Revenue and Expected Average Revenue): 
If the network state is 8 at the beginning of time slot k, and 
at least one user leaves the system for the first time (normal 
completion or forced termination) in time slot k + S, we 
define the average revenue over the period [k, k + 5} as 



g(Nth,o) = 



n c (N th ,8) n d {N th 



-Ca+NthRt, (23) 



where n c (N t h,&) is number of users completing connections 
normally in time slot k + 8, and nd(N t h, 8) is number of users 
being forced to terminate in time slot k + 5. The expected 
average revenue is denoted as 



G{N th ,8)=E{g(N th ,0)} 



= N c (N th , 8)R c -N d (N th , 8)C q + N th R t , 



(24) 



jr^/W^j an( j Nd{Nth:9) 



where N c (N th ,6) = E 

p ( n d (N th ,9) \ 
*\ 6+1 ]■ 

The expected revenue in Definition [3J is different from the 
instant revenue in (0. The expected revenue is defined under 
a very special case, where no new users are admitted except 
in the first time slot and no users leave the network except in 
the last time slot of the interval. The instant revenue defined 
in d5} is the revenue for a generic time slot. Furthermore, 
G(N t h,8) represents the expected average revenue per time 
slot when maintaining a fixed number of users until someone 
leaves. Although the precise value of G(Nth,8) is hard to 
compute explicitly, we have the following result as a corollary 
of Proposition [3] 

Proposition 4: Given any fixed Nth an d @< tne largest- 
delay-first channel allocation policy achieves the maximum 
G(N th ,6). 

Based on Proposition |4l we will still use the largest-delay- 
first strategy channel allocation. The key remaining issue 
is how to improve the admission control policy. Next we 
characterize the properties of the largest-delay-first channel 
allocation policy (the expected average revenue G(Nth,8) in 
the heuristic control policy) in several lemmas, which enable 
us to design a better heuristic algorithm for the admission 
control part. 

According to the lemmas given in Appendix |B] we can 
characterize G(N t h,8) as follows. 

Proposition 5: G(N t h,8) is a concave function of Nth- 
Proof: The second order derivative of G(N t h, 0) in terms 
of N th is 



G"(N th , 8) = N'J{N t h, 8)R C - N'J{N th , 8)C q 



(25) 
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J=10,p=q=0.5,Dmax=5,P ( =0.01 




Number of users (N th ) 

Fig. 7. The values of G(Nth,8) versus Nth corresponding to different 
values of R c and C q when J = 10, p — 0.5, q = 0.5, D max = 5, 
P f = 0.01, Rt = 0.7 and 8 = {m, [0, 0, 0, 0, 0,0]}. 



where N^'(N th , 0) < and N'J(N th , 0) > based on Lemma 
[T]and Lemma|i]in Appendix|B] Thus we have G"(N th , 9) < 0, 
i.e., G(N t h,9) is a concave function of TYj/j. ■ 

Figure Q plots G(N t h,0) versus Nth with fixed 9 = 
{m, [0, 0, 0, 0, 0, 0]} and different values of R c and C q . 

Now we are ready to discuss the heuristic admission control 
policy. Given a state 9 = (m,uj e ), the admission control 
decision can be either maintaining or searching, depending 
on the relationship between the channel state component m 
and user state component u> e . More precisely, if m is B 
(Bad) for u> e , the network coordinator will maintain the 
current user population and do not admit any new user (i.e., 
maintaining). This is because the network resource is not 
enough to support the current users, and admitting new users 
will make the situation worse. If m is G (Good) for u> e , 
the network coordinator first searches for the value of N£ h 
that achieves maxpf th G(N t h,0) (i.e., searching), and then 
admits the number of users equal to the difference between 
N£ h and the current users in the network. Proposition [5] 
shows that G(N t h,0) has a unique maximizer N£ h (with a 
fixed state 9), and implies a simple stopping rule for the 
numerical search. If we have G(N' th -1,6) < G(N' th , 9) and 
G(N' th , 9) > G(N' th + 1, 0), then N* h = N' th . 

The heuristic admission control introduced above is a 
rollout control policy based on the theory in Appendix lAl 
More specifically, the value of ma.XN th G(N t f l ,9) computed 
in the searching step is the cost-to-go starting from a state 
9. As Proposition shows that this is a concave maxi- 
mization problem, we can use several well-known numer- 
ical methods to achieve this. One possibility is the gradi- 
ent decent method, which has a linear convergence rate as 
shown in ll26l . More precisely, the maximum number of 
convergence of the gradient decent method is proportional 
to log (G{Nl% itial ,6) -G{NX" na \0))/e, where e is the 
stopping criterion. Since the precise value of G(N t h,0) is 
hard to compute with a low complexity, we will use an 
approximation G(N t h,0) instead in the searching step. In 
this paper, we use an on-line computation (simulation) to get 
G(N t h,9). Mover specifically, for each choice of (Nth, 8), 



we can obtain the value of g((N t h,9)) as in d23T > for each 
particular simulation, and take the average over many simu- 
lations to obtain an approximation G(N t h,9). The memory 
requirements are proportional to the expected length of the 
segments separated by the time slots in which there is at least 
one user leaving the system (normal completion or forced 
termination) 

C. Revenue Boundary 

In this subsection, we will compare the performance of two 
heuristic policies that we have proposed. Before that, we will 
establish an upper-bound of the revenue achievable under any 
control policy (heuristic or optimal). We call the bound the 
revenue boundary. 

We first prove the following property of the expected 
average revenue G(N t h, 9). 

Proposition 6: For a fixed number of users Nth, if there 
are two states 9i = {m, u,^ 1 )} and 2 — |m,u;( 2 )} such that 
> u>£\ we have G(N th ,9 1 ) > G(N th ,0 2 ). 
Proof: According to Lemma [3] in Appendix iBl we have 
N c (N t h,9x) > N c (N th ,9 2 ) and N d (N th ,0!) < N d (N th ,9 2 ). 
By substituting them into (f24]l, we get G(N t h,9 1 ) > 
G(N th ,9 2 ). ■ 

Then we can characterize the revenue boundary. 

Proposition 7: Consider a network state 9 = 
{m, [0, 0,0, ■ ■ •]}, where there are m available 
channels. The maximum expected revenue per time 
slot achieved by any policy, denoted by G max , satisfies 
Gmax < max mj jv th {G(N th , 0)} , where N th is an admission 
control threshold. 

Proof: Assume = {m,u>g} and 77 = |m,u>^} are 
two network states with m available channels and Nth users, 
where u>f = [N th ,0,0, ■ ■ ■}. If u)^ ^ wf, we have o,f > 
o>£. From Proposition |U we get G(N th ,9) > G(N t h,n)- In 
addition, after the control decision in the first time slot, Nth 
new secondary users are admitted in the case of 9 (since there 
are originally no users in the system), and no new secondary 
user is admitted in the case of 9 (since there are already Nth 
users with zero accumulative delay in the system). Thus, after 
the first time slot, we achieve the same state in both cases. In 
the following time slots, the expected changes of the two cases 
are thus the same. Therefore, according to the definition of G 
in Definition [3] we have G(N th ,9) = G(N th ,9). Therefore, 
the cost-to-go we compute in the search step is never larger 
than max m .iv th \G(Nth, #)}■ As the optimal policy can be 
viewed as a special case of the rollout policy by using the 
optimal policy as the base policy, it follows that the expected 
revenue per time slot of any policy (including the optimal one) 
is less than ma,x m , Nth {G(N th , 0)}. ■ 

In Section HU we have assumed perfect spectrum sensing. 
Under this assumption, the control policy of the throughput 
maximization problem studied in (4), can be simplified 
into admitting secondary users to make full use of the available 
channels in each time slot, which we call greedy admission 
control in this paper. Such greedy admission control policy 
will admit new users whenever possible such that the total 
active users in a time slot equals to the number of available 
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Rewnue comparison (p=0.5,q=0.5,R c =10,R t =1,C C| =10,P f =0.01) 




Number of channels (J) 



Fig. 8. Expected revenue comparison between the greedy control policy, the 
heuristic control policy, the rollout control policy, and the revenue boundary 
when Dmax = 5, p = 0.5, q = 0.5, R c = 10, Rt = 1, C q = 10, 
P f = 0.01 and J G [5, 10]. 

channels. Comparing with our proposed policy, this greedy 
policy is more aggressive and does not consider channel 
availabilities in the future, and thus will lead to a larger number 
of forced dropped users. We have plotted the expected revenue 
of the greedy admission control in Fig. [8] with the comparison 
with our proposed admission control and the revenue bound- 
ary. We can see that even the performance of our proposed 
heuristic control policy is better than that of the greedy 
control policy. The heuristic control policy (with the threshold- 
based admission control) is simple but effective, while the 
rollout algorithm achieves a slightly better performance but 
with a much higher computational complexity. The actual 
performance gap between the proposed algorithms and the 
optimal policy could be even smaller, as the revenue boundary 
in Proposition Q may not be very tight. 

VI. Conclusions 

Supporting QoS over cognitive radio networks is very chal- 
lenging, mainly due to the uncertainty of available communi- 
cation resources. As one further step towards understanding 
this under-explored yet practically important research area, 
we considered supporting delay sensitive traffic in cognitive 
radio networks. The key is to jointly optimize admission 
control and channel allocation, in order to balance the number 
of concurrent sessions and the QoS of each session. We 
formulated the problem as an infinite-horizon Markov decision 
process problem, and proved that the optimal average revenue 
is independent of the initial system state. Then we transformed 
the original problem into a stochastic shortest path problem, 
and proved that the Bellman's equation converged to the 
optimal policy. Furthermore, we proposed a heuristic control 
policy and proved that the largest-delay-first strategy is optimal 
given threshold-based admission control. We further proposed 
a rollout algorithm that improves upon the heuristic algorithm 
by doing dynamic admission control. By comparing with a 



revenue bound, we show that both of our proposed algorithms 
achieve close-to-optimal performance. 

Appendix 

A. Rollout Algorithm 

For convenience, we consider the finite-horizon stochastic 
shortest path problem as a discrete-time dynamic system 

Xk+i = f(x k ,u(xk),Ck), k = 0,l,---. (26) 

According to definitions in Section Hill x k is the state (belong- 
ing to the state space S) at time slot k, u{xk) is the control 
selected from the control space U at time k, £fc is a random 
disturbance caused by the activities of the users at time k, 
and / is the state transition function. We focus on an iV-stage 
horizon problem with a terminal cost g{xn) that depends on 
the terminal state xm- We define the cost-to- go of a policy 
H = {u(xo),u(x\), ■ ■ ■ , u(xjv-i)} starting from a state Xk at 
time slot k as 

J£{xk) =e!^A*- g(x N ) + £ {A* - g( Xi , u(*i))} j ■ 

(27) 

The optimal cost-to-go starting from a state Xk in time slot k is 
Jk(xk) = inf^t Jt(xk), and it satisfies the following recursive 
relationship 

Jk(x k )= inf E {A*-g(xk,u(x k ))+Jk+i(f(x k ,u(x k ),w k ))}, 

(28) 

with k = 0, 1, ••■ ,N — 1 and the initial condition is 
Jn(xn) = A* —g(xpf). We can also extend the definitions to 
infinite-horizon problems with minor modifications. 

An optimal policy could be obtained by calculating the 
optimal cost-to-go functions J k . But it is prohibitively time- 
consuming for our problem. To reduce the computation com- 
plexity, we can adopt the rollout algorithm by replacing the 
optimal cost-to-go function Jk+i in d28l i with an approxima- 
tion J k+1 . 

In the rollout algorithm, some known heuristic or sub- 
optimal policy [i, called the base policy, will be used to 
calculate the approximating function Jk+i- The values of the 
approximate cost-to-go Jk+i may be computed in a number 
of ways: by a closed-form expression, by an approximate off- 
line computation, or by an on-line computation. The improved 
policy is called the rollout policy based on fi. It is a one-step 
lookahead policy (by using (|251 once), where we approximate 
the optimal cost-to-go on the right hand side of ( l28b by the 
cost-to-go of the base policy. The more detailed description of 
the rollout algorithm can be found in references l24l . l25l . 

B. Several Lemmas for Proving Propositions and [6] 

After defining the expected revenue g(N t h, 0) and the 
expected average revenue G {Nth iff) in Definition [5] we give 
the following intermediate lemmas to help to illustrate the 
properties of G{N t h, ff) in terms of the first and second order 
derivatives. 

Lemma 1: For a fixed state 9, N c {N t h,9) is a non- 
decreasing and concave function of the number of users Nth- 
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Proof: Recall that all the users have the same completion 
probability Pf when they are actively served. Thus we have 
N c (N t h + 1, 0) > N c {N t h: 0), as having one more user means 
that it is possible to actively serve one more user and thus have 
one more normal session completion. Furthermore, we assume 
that under the same channel condition and over a period of 
time slots, the incremental number of served users per time slot 
is Ai when the number of users changes from Nth — 1 to Nth- 
Then A2, the incremental number of served users per time slot 
from Nth to Nth + 1> should be no bigger than Ai. This is 
because if Nth + 1 users can be allocated available channels, 
Nth users could be allocated available channels in the same 
time slot. Therefore, we have N c (N t h + 1,9)- N c (N t h, 9) < 
N c (N th ,6) - N c (N th - 1,0), which means N c (N th ,6) is a 
non-decreasing and concave function of Nth- ■ 

Lemma 2: For a fixed state 0, N d (N t h,9) is a non- 
decreasing and convex function of the number of users Nth- 
Proof: Having one more admitted user means that a 
higher probability of a forced termination, i.e., N d (N t h + 
1,0) > N d (N th ,8). Under the largest-delay-first channel 
allocation policy, define Ai = N d (N th ,8) - N d {N th - 1,0) 
and the additional user as U ser , and A2 = N d {N t h + 1,0) — 
N d (N t h,&)- For discussion convenience, we call the system 
with Nth ~ 1 users as Case 1, the system with N t h users as 
Case 2, and the system with Nth + 1 users as Case 3. In Case 
2, we divide users into two parts: U ser and other N^ — l users. 
In Case 3, we also divide users into two parts: U ser and other 
Nth users. Then we define N d (N th + 1,0) = N d (N th ,0) + 
Nl{U ser ,6) andN d (N th ,0) = N d (N th -l,6) + Nj(U ser ,6). 
Here N d (U se r, 0) and N' d (N t h, 9) represent the corresponding 
parts of N d (N t h + 1,0) caused by the forced termination of 
U ser and other users in Case 3, respectively; N d (U se r,9) 
and N'JNth ~ 1,0) represent the corresponding parts of 
N d (N t h,9) caused by the forced termination of U ser and other 
users in Case 2, respectively. On this basis, we further define 
A 2 = A' 2 + A' 2 ', where A' 2 = N' d (N th ,0) - N' d {N th -1,0) 
and A 2 ' = Nl(U ser ,6) - Nj(U ser , 0). 

In Case 2 and Case 3, we now exclude the user U aer from 
the system and assume the channels allocated to U ser are occu- 
pied by primary users. Then we can have the above expression 
of A' 2 to illustrate the effect of the increased user U ser from 
Nth-1 to N th . Comparing A' 2 = N' d (N th , 0)-N' d (N th -l, 0) 
with Ai = N d {N th , 0) - N d (N th - 1), the difference is that 
in any time slot (on any sample path), the channel state of 
A 2 is always no better than that of the Ai case (as the extra 
user U ser may occupy an available channel). Therefore, in 
terms of the expected number of users forced to leave the 
system per time slot, the effect of the increased user to A 2 
is larger than that to Ai. This leads to A 2 > Ai. Moreover, 
considering U ser from Case 2 to Case 3, we have A 2 > 
under the largest-delay-first policy. From the above analysis, 
we get A 2 > Ai, i.e., N d {N th + 1,0) - N d (N th ,0) > 
N d (N t h,6) - N d (N th - 1,0), which means N d {N th ,6) is a 
non-decreasing and convex function of Nth El- ■ 

Lemma 3: For a fixed number of users Nth, if there are 
two states 0i = {m,^ 1 )} and 2 = {m,u;( 2 )} such 
that w* 1 ) > u/ 2 >, we have N^N^O^ > N c (N th ,8 2 ) and 
N d (Nth,e 1 )<N d (Nt h ,6 2 ). 



Proof: The lemma directly follows the definitions of 
^ > ^i 2) in Definition □ and N c (N th ,0), N d (N th ,0) in 
Definition [3] If u)^ 1 > the user state u;^ 1 ) can reach the 
user state under a proper channel condition and a control 
policy. Consider two systems with the initial states 9% and 2 , 
respectively, and follow the same channel conditions over time 
and the same control policy. When a user is forced to leave 
the system (completes the connection, respectively) with 6\, in 
the system with 2 , there must be a user that is forced to leave 
(completes the connection or is forced to leave, respectively) 
in the current or an earlier time slot. Therefore, we get 
N c {N th ,0i) > N c (N th ,9 2 ) and N d (N t h,9i) < N d {N thl 8 2 ) 
based on the definitions of N c (N th , 0) and N d (N thl 8). ■ 
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